Detecting spelling variants in non-standard texts

نویسنده

  • Fabian Barteld
چکیده

Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as a non-standard spelling for tomorrow. Consequently, in normalization – the standard approach of dealing with spelling variation – so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computermediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design and implementation of Persian spelling detection and correction system based on Semantic

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors.  Also developing Persian tools will provide Persian progr...

متن کامل

The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic?

The identification of spelling variants in English and German historical texts: manual or automatic? Dawn ARCHER (University of Central Lancashire) Andrea ERNST-GERLACH, Sebastian KEMPKEN, Thomas PILZ (Universität Duisburg-Essen) Paul RAYSON (Lancaster University) The identification of spelling variants in English and German historical texts: manual or automatic?

متن کامل

Generating Search Term Variants for Text Collections with Historic Spellings

In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have ...

متن کامل

A Multi-Agent System for Detecting and Correcting "Hidden" Spelling Errors in Arabic Texts

: In this paper, we address the problem of detecting and correcting hidden spelling errors in Arabic texts. Hidden spelling errors are morphologically valid words and therefore they cannot be detected or corrected by conventional spell checking programs. In the work presented here, we investigate this kind of errors as they relate to the Arabic language. We start by proposing a classification o...

متن کامل

Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization

In this paper, we argue that comparable collections of historical written resources can help overcoming typical challenges posed by heritage texts enhancing spelling normalization, POS-tagging and subsequent diachronic linguistic analyses. Thus, we present a comparable corpus of historical German recipes and show how such a comparable text collection together with the application of innovative ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017